For this project, I am wondering whether the different in people’s income would influence the death cases by Covid-19in the US. For the first dateset, I choose to use Median Income for each state in the US provided by United State Census and the link is ‘https://www.census.gov/search-results.html?q=Median+income+&page=1&stateGeo=none&searchtype=web&cssp=SERP&_charset_=UTF-8’. For the second dateset, I choose to use the collection of Covid-19 cases and all-causes death cases in each state and county in the US provided by the CDC and the link is ‘https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-in-the-United-St/kn79-hsxy’
# INITALIZE CODE CHUNK OPTIONS
opts_chunk$set(
warning = FALSE,
message = FALSE,
eval=TRUE,
echo = TRUE,
cache = FALSE,
fig.width = 7,
fig.align = 'center',
fig.asp = 0.618,
out.width = "700px",
class.source = "code-r")
I need to merge those two datasets by the variable ‘State’ to get a full dataset which is helpful for the further analysis. Then, I delete the comma occurred in some numerical number such as changing 14,500 to 14500 in order to better run the data in R. For the next step, I renamed certain variables that include ‘space’ like changing “urban rural code” to “urban_rual_code” as a whole word. Before providing some statistical result, the most important step is to check the missing value occurs in our data. For any observations with the missing value for the death cases, I just replaced them with 0. In order to better summary the key outcome by the variable state, I created new variables to reflect the total death cases in each state. Then, I created a table to show the details of each key variable. The table contains four variables which classified by State: number of counties, Income, COVID-19 death cases and all-caused death cases. For the data visualization, I plotted 4 graphs to show the association between each key variables. For example, I used draw a US map to show the density of COVID-19 death in each state and draw a scatter plot to reflect the linear association between Income and number of Covid-19 death cases.
income <- data.table::fread("/Users/websterj/Desktop/median income.csv")
covid <- data.table::fread("/Users/websterj/Desktop/Covid.csv")
Merge two dataset
covid1 <- merge(
x = income,
y = covid,
all.x = TRUE, all.y = FALSE,
by= "State"
)
covid1<-data.table(covid1)
#check the dimension of the dataset
dim(covid1)
## [1] 3023 14
nrow(covid1)
## [1] 3023
ncol(covid1)
## [1] 14
# To know the contents of data
str(covid1)
## Classes 'data.table' and 'data.frame': 3023 obs. of 14 variables:
## $ State : chr "AK" "AK" "AK" "AK" ...
## $ State_full_name : chr "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ Income : int 77640 77640 77640 77640 77640 77640 77640 77640 77640 77640 ...
## $ Margin Of Error : chr "+/- $1,015" "+/- $1,015" "+/- $1,015" "+/- $1,015" ...
## $ Lon : num 61.4 61.4 61.4 61.4 61.4 ...
## $ Lat : num -152 -152 -152 -152 -152 ...
## $ Date as of : chr "10/20/21" "10/20/21" "10/20/21" "10/20/21" ...
## $ Start Date : chr "1/1/20" "1/1/20" "1/1/20" "1/1/20" ...
## $ End Date : chr "10/16/21" "10/16/21" "10/16/21" "10/16/21" ...
## $ County name : chr "Aleutians East Borough" "Anchorage Municipality" "Bethel Census Area" "Denali Borough" ...
## $ FIPS County Code: int 2013 2020 2050 2068 2070 2090 2105 2110 2122 2130 ...
## $ Urban Rural Code: chr "Noncore" "Medium metro" "Noncore" "Noncore" ...
## $ Covid_death : chr "" "366" "18" "" ...
## $ All_causes_death: chr "14" "4,133" "194" "11" ...
## - attr(*, ".internal.selfref")=<externalptr>
## - attr(*, "sorted")= chr "State"
# Rename some variables
colnames(covid1)[12] <- "urban_rural_code"
colnames(covid1)[10] <-"county_name"
# Check variables
# For two of our variables, they are not in numerical form, so we need to transform it.
covid1$`Covid_death` <- as.numeric(gsub(",","",covid1$`Covid_death`))
covid1$`All_causes_death` <- as.numeric(gsub(",","",covid1$`All_causes_death`))
# Create new variables for total COVID_19 cases in certain State and death in different kind of urban classification.
covid1<-as.data.table(covid1 %>%
group_by(State) %>%
mutate(
total_covid_death_instate = sum(Covid_death)
))
covid1<-as.data.table(covid1 %>%
group_by(State) %>%
mutate(
total_all_caused_death_instate = sum(All_causes_death)
))
covid1<-as.data.table(covid1 %>%
group_by(urban_rural_code) %>%
mutate(
death_mean_urban = mean(Covid_death)
))
# Key Variables
table(covid1$State_full_name)
##
## Alabama Alaska Arizona
## 67 19 15
## Arkansas California Colorado
## 73 56 1
## Connecticut Delaware District of Columbia
## 57 3 1
## Florida Georgia Hawaii
## 67 155 4
## Idaho Illinois Indiana
## 42 101 92
## Iowa Kansas Kentucky
## 99 103 119
## Louisiana Maine Maryland
## 62 15 24
## Massachusetts Michigan Minnesota
## 14 82 84
## Mississippi Missouri Montana
## 81 114 50
## Nebraska Nevada New Hampshire
## 79 16 10
## New Jersey New Mexico New York
## 21 29 61
## North Carolina North Dakota Ohio
## 99 43 88
## Oklahoma Oregon Pennsylvania
## 75 34 67
## Rhode Island South Carolina South Dakota
## 5 46 57
## Tennessee Texas Utah
## 95 245 25
## Vermont Virginia Washington
## 13 131 38
## West Virginia Wisconsin Wyoming
## 52 71 23
summary(covid1$Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45081 54602 58700 59991 61973 86420
summary(covid1$Lon)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.09 34.84 38.46 38.27 41.68 61.37
summary(covid1$Lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -157.50 -97.56 -89.62 -91.08 -82.76 -69.38
summary(covid1$total_covid_death_instate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1512 25922 26585 22817 26585 26585 2895
summary(covid1$total_all_caused_death_instate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 8363 70288 132315 165184 197070 573696 80
# Check missing values:
mean(is.na(covid1$Income))
## [1] 0
mean(is.na(covid1$`Covid_death`))
## [1] 0.195832
mean(is.na(covid1$`All_causes_death`))
## [1] 0.0006615944
# We find that there are missing values in the variable: "Death involving COVID-19" and "Deaths from All Causes", what I will do is to assign value 0 to those missing values.
covid1$`Covid_death`[is.na(covid1$`Covid_death`)] <- 0
covid1$`All_causes_death`[is.na(covid1$`All_causes_death`)] <- 0
We checked the dimension of our data and noticed that there are 3023 total observations and 14 different factors for each of our observation. Then, I did some summaries for the key variables such as Income, Covid-19 death cases and all caused death cases. I found the the lowest median income for people living in certain state is $45081 and highest median income for people living in certain state is $86420. Also, I noticed that the lowest death cases caused by COVID-19 is in Colorado which equals to 0 and highest death cases caused by COVID-19 in California which equals to 73920 and mean death cases caused by COVID-19 in the US is 20504 for any state. From the data visualization, The first plot shows that California, Florida, New York and Texas contains much more COVID-19 death than other states. For the second plot, we noticed that the range of Income between each state is relatively large which equals to 41339, Mississippi with the lowest median income which equals to 45081 and District of Columbia with the highest median income which equals to 86420. The third graph is about the association between different urban-rural classification and COVID-19 death cases. We found that there is a relative positive linear association, as the counties contains more population, the more COVID-19 death cases occurs. The last graph is the scatter plot for the association between Income and COVID-19 death. However, the pattern is not clear and looks like a normal distribution.
tab <- covid1[, .(
Number_of_County =length(unique(county_name)),
Income=unique(Income),
Covid_death=sum(Covid_death),
All_death=sum(All_causes_death)
),
by=State]
datatable(tab)
# Using Usmap to show the number of Covid_19 death cases in each state.
covid1<-as.data.table(covid1 %>%
group_by(State) %>%
mutate(
total_covid_death_instate = sum(Covid_death)
))
covid2<-covid1
colnames(covid2)[1] <- "state"
usmap<-plot_usmap(regions = 'states', data = covid2, values ='total_covid_death_instate', labels = TRUE) +
scale_fill_continuous(low = "white", high = "blue", name="Number of COVID-19 death")
ggplotly(usmap)